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Abstract — Data mining is the one of the hottest 
topics in research of computer Science. In 
medical field the industry gathers massive 
amount of healthcare data which are not 
“mined” to find out unknown information. Data 
Mining tools and techniques can be positively 
applied in many fields in various kinds of . Now 
a days so many people are also suffering 
various kinds of diseases. The medicinal 
industries come across with new treatments 
and medicine each and every day. The 
medicinal industries must deliver well 
conclusion and remedy to the patients to 
achieve worthy quality of service. Most of the 
Organizations using Data Mining as a powerful 
tool, to deal with the reasonable situation for 
data analysis. The primary purpose of this 
paper is to provide detail information about 
which tools and techniques are used to identify 
the accuracy level of various diseases and data 
mining applications. This data mining based 
prediction system reduces the human effects 
and cost effective one. There are various kinds 
of oss tools mining tools are discussed. 


Keywords - Data Mining: concepts-Tools and 
Techniques-Algorithms Health Data Analysis- 
Data Mining Applications; Classification; 
Clustering; Association. 


1. INTRODUCTION 

Data mining is also known as knowledge 
discovery from Data (KDD). The purpose of data 
mining is to mine useful information from huge 
databases or data warehouses. Now a days, Data 
Mining is becoming common in healthcare field 
because there is an essential of operational 
analytical methodology for detecting unidentified 
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and valuable information in health data. In health 
industry, Data Mining offers numerous benefits 
such as recognition of the fraud in health 
insurance, availability of medical solution to the 
patients at lesser price, detection of bases of 
diseases and identification of medical treatment 
methods. Data mining algorithms useful in 
healthcare industry and shows an important role in 
estimate and finding of the diseases[1]. There are 
a huge number of data mining applications are 
establish in the medical related areas such as 
Medical device industry, Pharmaceutical Industry 
and Hospital Management and health sector 
management. To catch the valuable and unknown 
information from the database is the determination 
behind the application of data mining. The 
knowledge discovery is an interactive process, 
containing by developing an understanding of the 
application domain, choosing and making a data 
set, preprocessing, data transformation[2] 

The data made by the health organizations is 
exact huge and difficult due to which it is hard to 
analyze the data in order to mark vital conclusion 
regarding patient health. This data covers details 
regarding hospitals, patients, medical claims, 
treatment cost and etc. So, there is an essential to 
make a powerful tool for analyzing and extracting 
significant information from this complex data. The 
analysis of health data expands the healthcare by 
improving the presentation of patient management 
jobs[3]. The consequence of Data Mining 
technologies are to make available welfares to 
healthcare organization for grouping the patients 
having related/similar type of diseases or health 
issues so that healthcare organization provides 
them active treatments. It can also valuable for 
predicting the how many days of stay of patients in 
hospital, for medical diagnosis and creating plan 
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for active information system management. New 
and current technologies are used in medical field 
to improve the medical services in cost effective 
manner[4]. 


2. Classification of Data Mining System 

Data mining systems can be categorized 
according to various criteria as given below 

1) Type of data sources mined 

2) Database involved 

3) Kind of knowledge discovered 

4) Mining techniques used 


3. Process of Data Mining 

The Data mining process includes the 
following few steps 

1) Data Cleaning: It is used to remove noise 
and inconsistent data. 

2) Data Integration: It is used to combine 
multiple data sources. 

3) Data Selection: It is used to retrieve the 
relevant data from the database for analysis task. 

4) Data Transformation: It is used to 
transformed or consolidated data into particular 
appropriate form for mining by performing 
summary or aggregation operations. 

5) Data Mining: Here the intelligent methods 
are applied in order to extract data patterns. 

6) Pattern Evaluation: It is used to evaluate the 
data patterns. 

7) Knowledge Presentation: Here the 
knowledge is represented. 

5) Data Mining: Here the intelligent methods 
are applied in order to extract data patterns. 

6) Pattern Evaluation: It is used to evaluate the 
data patterns. 

7) Knowledge Presentation: Here the 
knowledge is represented. 

The following figure 1 shows the data mining 
process. 
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Figure: 1 Data Mining steps 
1179 


VOL.13 No.03 MAR 2023 


3.1. Data mining techniques 

There are enormous number of data 
mining techniques have been evolving and using 
in data mining projects recently. Some of the data 
mining techniques are given below, 


3.1.1 Association 

Association is one of the top - well known 
data mining techniques. In association, a pattern is 
learned based on an association between items in 
the similar transaction. That’s the purpose the 
association technique is also well-known as 
relation technique. The association technique is 
used in marketplace basket analysis to classify a 
set of products that customers regularly purchase 
together[5]. Dealers are using association 
technique to investigation buyer’s buying lifestyles. 
Based on ancient sale data, retailers might catch 
out that customers always buy jam when they buy 
breads, and, therefore, they can put jam and 
breads following to each other to save time for 
customer and make steps to growth sale. 


3.1.2 Classification 

Classification is a common data mining 
technique based on machine learning. Mostly, 
classification is used to categorize each item ina 
set of data into one of a predefined set of classes 
or groups[6]. 

Classification method uses variety of 
mathematical techniques such as decision trees, 
linear programming, neural network and statistics. 
In classification, make the software that can 
acquire how to classify the data items into 
groups[6]. For instance, first apply classification in 
the application that “given all records of employees 
who left from the company; predict who _ will 
probably leave the company in a future period.” In 
this case, we divide the records of employees into 
two groups that named “leave” and “stay”. And 
then can ask our data mining software to classify 
the employees into separate groups[7]. 


3.1.3 Clustering 

Clustering is a data mining technique that 
makes an expressive or valuable cluster of objects 
which have related or same characteristics using 
the automatic technique. The clustering technique 
describes the classes and puts objects in each 
class, while in the classification techniques, 


www. ijitce.co.uk 


INTERNATIONAL JOURNAL OF INNOVATIVE TECHNOLOGY AND CREATIVE ENGINEERING (ISSN:2045-8711) 


objects are allotted into predefined classes. To 
make the concept richer, can take book 
management in the library as an instance. In a 
library, there is a wide variety of books on different 
topics available. The challenge is how to hold 
those books in a way that readers can take several 
books on a particular topic without trouble. By 
using the clustering technique, can retain books 
that have some kinds of similarities in one cluster 
or one bookshelf and label it with a meaningful 
name. If readers need to take books in that topic, 
they would only have to go to that bookshelf 
instead of looking for the whole library. 
There are two types of cluster in data mining 

i. Inter cluster 

ii. Intra cluster 

We have two type of distance — Intercluster 
Distance and Intracluster Distance. 

Intercuster Distance: 
Intercluster distance is the distance between two 
objects belonging to two different clusters. 

Intracuster Distance: 
Intracluster distance is the distance between two 
objects belonging to same cluster. 


3.1.4 Prediction 

The prediction, as its name implied, is one of a 
data mining techniques that determine the 
association between independent variables and 
correlation between dependent and independent 
variables. For instance, the prediction analysis 
technique can be used in the sale to calculate 
income for the future if consider the sale is an 
independent variable, income could be a 
dependent variable. Then based on the ancient 
sale and earnings data, we can draw a fixed 
regression curve that is used for profit 
prediction[8]. 


3.1.5 Sequential Patterns 

Sequential patterns analysis is one of data mining 
technique that pursues to determine or recognize 
associated patterns, fixed events or fashions in 
transaction data over a business period. In sales, 
with ancient transaction data, businesses can 
recognize a set of items that customers buy 
together different times in a year. Then industries 
can use this information to mention customers buy 
it with better deals based on their purchasing 
regularity in the past[9]. 
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3.1.6 Decision trees 

One of the best classification techniques in 
data mining . Just like a hierarchical structure 
format. A decision tree is one of the best common 
used data mining techniques because its model is 
easy to understand for users. In decision tree 
technique, the root of the decision tree is a simple 
question or condition that has compound answers. 
Each answer then leads to a group of questions or 
conditions that helps to determine the data so that 
can make the ultimate decision based on it. For 
example, use the following decision tree to 
determine whether a person has eligible for vote or 
not[1 0]. 

Combine two or more of those data mining 
techniques composed to form an appropriate 
process that meets the business requirements. 

Decision trees have three main parts: a root 
node, leaf nodes and branches. The root node is 
the starting point of the tree, and both root and leaf 
nodes contain questions or criteria to be 
answered. Branches are arrows connecting 
nodes, showing the flow from question to answer. 
Each node typically has two or more nodes 
extending from it. For example, if the question in 
the first node requires a "yes" or "no" answer, there 
will be one leaf node for a "yes" response, and 
another node for "no. 


Age>=18 
Eligible for Not Eligible 
— for the vote 


Fig.2 Decision Tree 


3.1.8 Decision Tree uses 
A decision tree can be used in either a 
predictive manner or a descriptive manner. In 
either instance they are constructed the same way 
and are always used to visualize all possible 
outcomes and decision points that occur 
chronologically[12]. Decision trees are most 
commonly used in the financial world for areas 
such as loan approval, portfolio management, and 
spending. A decision tree can also be helpful when 
examining the viability of a new product or defining 
a new market for an existing product[13]. 
www. ijitce.co.uk 
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4. Data Mining Tools 
The Various kinds of data mining tools are 
given below for reference, 
1) Artificial Neural Networks (ANN), 
2) Rough Set Theory (RST), 
3) Statistical Package for the Social Sciences 
modeler (SPSS), 
4) K-means clustering 
5) Single Nucleotide Polymorphism (SNP) 
6) Orange 
7) SAS Mining tools 
8) Rattle 
9) Rapid miner. 
10) Data melt Data mining. 
The Eight best open source Data mining 
tools are given below, 
1) Tera data 


Weka 


Knime 
Natural Language Toolkit (NLTK) 
Apache Mahout 


R analytics is data analytics using R 
programming language, an open-source language 
used for statistical computing or graphics. This 
programming language is often used in statistical 
analysis and data mining. It can be used for 
analytics to identify patterns and build practical 
model 
Rapid Miner 

It is one of the best predictive analysis 
systems. Also, it was developed by the company 
with the same name as the Rapid Miner. It is 
written in JAVA programming language. It provides 
an integrated environment for deep learning. 
Weka 

This software developed at the University 
of Waikato in New Zealand. It is best suited for 
data analysis and predictive modeling. It contains 
algorithms and visualization tools that support 
machine learning. 

Weka has a GUI that facilitates easy 
access to all its features. It is written in JAVA 
programming language. 


1181 


VOL.13 No.03 MAR 2023 


KNIME 

KNIME is the best integration platform for 
data analytics. Also reporting developed by 
KNIME.com AG. It operates on the concept of the 
modular data pipeline. KNIME constitutes of 
various machine learning and data _ mining 
components embedded together. 

It has been used for pharmaceutical research. In 
addition, it performs for customer data analysis, 
financial data analysis. 

Sisense 

Sisense is extremely useful and best suited 
BI software. That it comes to reporting purposes 
within the organization. It is developed by the 
company of same name ‘Sisense’. It has a brilliant 
capability to handle. Also, process data for the 
small-scale/large scale organizations. 

SSDT 

SSDT is a universal, declarative model. We 
use this model to expands all the phases of 
database development in the Visual Studio IDE. 
And developed to do data analysis and provide 
business intelligence solutions. Developers use 
SSDT transact- a design capability of SQL and 
refactor databases 

SSDT Apache Mahout is a_ project 
developed by Apache Foundation. Also, it serves 
the primary purpose of creating machine learning 
algorithms. It focuses mainly on data clustering, 
classification, and collaborative filtering. 

Mahout is written in JAVA and includes JAVA 
libraries to perform mathematical operations. Such 
as linear algebra and statistics. Mahout is growing 
continuously as the algorithms implemented inside 
Apache Mahout. The algorithms of Mahout have 
implemented a level above Hadoop. 

Oracle 

A component of Oracle Advanced 
Analytics, it software provides excellent data 
mining algorithms. 

The algorithms designed inside ODM 
leverage the potential strengths of Oracle 
database. The data mining feature of SQL can dig 
data out of database tables, views, and schemas. 
Rattle 

A rattle is a GUI tool that uses R stats 
programming language. Rattle exposes the 
statistical power of R by providing considerable 
data mining functionality. Although Rattle has an 
extensive and well-developed UI. Also, it has an 
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inbuilt log code tab that generates duplicate code 
for any activity happening at GUI. 

DataMelt, also known as DMelt is a 
computation and visualization environment. Also, 
provides an interactive framework to do data 
analysis and visualization. It is designed mainly for 
engineers, scientists & students. 

DMelt is a multi-platform utility. It can run 
on any operating system which is compatible with 
JVM(Java Virtual Machine). 

IBM Cognos BI is an intelligence suite. It 
consists of sub-components that meet specific 
organizational requirements. 

Cognos Connection: A web portal to gather and 
summarize data in scoreboard/reports. 
Orange 

Orange is a perfect software suite for 
machine learning and data mining. It best aids the 
data visualization and is a component-based 
software. 

As it is a software, the components of orange are 
called ‘widgets’. 
Widgets offer major functionalities like 
e Showing data table and allowing to select 
features 
e Reading the data 
e Training predictors and 
learning algorithms 
e Visualizing data elements etc. 


to compare 


5. Data Mining Applications in Healthcare 

Data mining tools are used to predict the 
effective results from the data verified on 
healthcare problems. Different data mining tools 
are used to calculate the accuracy level in different 
healthcare problems. The following list of medical 
problems has been evaluated and estimated|[14]. 
Heart Disease Cancer 
HIV/AIDS Blood 
Brain Cancer Tuberculosis 
Diabetes Mellitus Kidney dialysis 
Dengue IVF 
Hepatitis C 


KANNAN 
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S| TYPE OF | DATA TECHNIQUES | ALGORI | Acc 
. | DISEASES MINING THMS ura 
N TOOLS cy 
O 
Heart ee : 
1 Diceqeke H2O Classification Navie 72.5 
ies a Decision 
2 | Cancer Weka Classification Table 95.0 
eh eeney R 0.4.5 Decision | 39 4 
Diseases making 
4 Heart H20 Classification Navie 80 
Diseases 
5 | Dengue SPSS Classification C 5.0 91 
Brain K- : . 73.2 
6 Cancer Means Clustering j-48 0 
7 | Hepatitis C Kaine Information Decision 85 
p Gain Rule 


Table 1: Tools and Techniques in Healthcare 
The above table 1 show the various tools and 
techniques are used to find the accuracy level of 
various diseases. 


6. Conclusion 

In this research paper mainly focus various 
OSS Software tools and Mining algorithms . The 
prediction of diseases using Data Mining 
applications is an inspiring task but it extremely 
reduces the human strength and increases the 
diagnostic accuracy. Developing well-organized 
data mining tools for an application could decrease 
the cost and time control in terms of human 
resources and capability. Discovering knowledge 
from the medical data is such a complicated 
job as the data found are noisy, irrelevant and 
massive too. In this scenario, data mining tools 
come in close in discovering of knowledge of the 
medical data and it is fairly interesting. It is 
detected from this table1 that a mixture of more 
than one data mining techniques than a single 
technique for diagnosing or predicting diseases in 
healthcare sector could produce additional 
encouraging results. The Table 1 displays the 
motivating results that data mining techniques in 
all the health care applications offer a more 
promising level of accuracy like 95.77% for cancer 
prediction and around 80 % for estimating the 

success rate of IVF treatment. 
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